Son Nguyen
Classification is a process of finding the decision boundary that best separate two classes
\[ IG = I_{parent} - \frac{N_{left}}{N}I_{left}-\frac{N_{right}}{N}I_{right} \]
\[ \begin{aligned} {\text{By Classification Error: }} I &= min\{p_0, p_1\} \\ {\text{By Gini Index: }} I&= 1 - p_0^2-p_1^2 \\ {\text{By Entropy: }} I &= -p_0 \log_2(p_0)-p_1\log_2(p_1) \end{aligned} \]
For Split 1: \( N = 5, N_{left} =1, N_{right} = 4 \)
Node parent, A: \( p_0 = \frac{2}{5}, p_1 = \frac{3}{5} \). Thus, \( I_{A} = \text{min}(\frac{2}{5}, \frac{3}{5}) = \frac{2}{5} \)
Node child left, L: \( p_0 = \frac{0}{1} = 0, p_1 = \frac{1}{1} = 1 \). Thus, \( I_{L} = \text{min}(0, 1) = 0 \)
Node child right, R: \( p_0 = \frac{3}{4}, p_1 = \frac{1}{4} \). Thus, \( I_{R} = \text{min}(\frac{3}{4}, \frac{1}{4}) = \frac{1}{4} \)
Impurity Gain of Split 1:
\[ IG = \frac{2}{5} - \frac{1}{5} \cdot 0-\frac{4}{5} \cdot \frac{1}{4} = 0.2 \]
For Split 2: \( N = 5, N_{left} =2, N_{right} = 3 \)
Node parent, A: \( p_0 = \frac{2}{5}, p_1 = \frac{3}{5} \). Thus, \( I_{A} = \text{min}(\frac{2}{5}, \frac{3}{5}) = \frac{2}{5} \)
Node child left, L: \( p_0 = \frac{1}{2}, p_1 = \frac{1}{2} \). Thus, \( I_{L} = \frac{1}{2} \)
Node child right, R: \( p_0 = \frac{2}{3}, p_1 = \frac{1}{3} \). Thus, \( I_{R} = \text{min}(\frac{2}{3}, \frac{1}{3}) = \frac{1}{3} \)
Impurity Gain of Split 2:
\[ IG = \frac{2}{5} - \frac{2}{5} \cdot \frac{1}{2}-\frac{3}{5} \cdot \frac{1}{3} = 0 \]
For Split 3: \( N = 5, N_{left} =3, N_{right} = 2 \)
Node parent, A: \( p_0 = \frac{2}{5}, p_1 = \frac{3}{5} \). Thus, \( I_{A} = \text{min}(\frac{2}{5}, \frac{3}{5}) = \frac{2}{5} \)
Node child left, L: \( p_0 = \frac{1}{3}, p_1 = \frac{2}{3} \). Thus, \( I_{A} = \text{min}(\frac{1}{3}, \frac{2}{3}) = \frac{1}{3} \)
Node child right, R: \( p_0 = \frac{2}{2}, p_1 = \frac{0}{2} \). Thus, \( I_{R} = \text{min}(1,0) = 0 \)
Impurity Gain of Split 3:
\[ IG = \frac{2}{5} - \frac{3}{5} \cdot \frac{1}{3}-\frac{2}{5} \cdot 0 = 0.2 \]
| IG | |
|---|---|
| Split 1 | 0.2 |
| Split 2 | 0 |
| Split 3 | 0.2 |
For Split 1: \( N = 5, N_{left} =1, N_{right} = 4 \)
Node parent, A: \( p_0 = \frac{2}{5}, p_1 = \frac{3}{5} \). Thus, \( I_{A} = 1 - (\frac{2}{5})^2-(\frac{3}{5})^2 = 0.48 \)
Node child left, L: \( p_0 = \frac{0}{1} = 0, p_1 = \frac{1}{1} = 1 \). Thus, \[ I_{L} = 1 -0^2-1^2 = 0 \]
Node child right, R: \( p_0 = \frac{3}{4}, p_1 = \frac{1}{4} \). Thus, \[ I_{R} = 1-(\frac{3}{4})^2-(\frac{1}{4})^2 = 0.375 \]
Impurity Gain of Split 1:
\[ IG = 0.48 - \frac{1}{5} \cdot 0-\frac{4}{5} \cdot 0.375 = 0.18 \]
For Split 2: \( N = 5, N_{left} =2, N_{right} = 3 \)
Node parent, A: \( p_0 = \frac{2}{5}, p_1 = \frac{3}{5} \). Thus, \( I_{A} = 1-(\frac{2}{5})^2- (\frac{3}{5})^2 = 0.48 \)
Node child left, L: \( p_0 = \frac{1}{2}, p_1 = \frac{1}{2} \). Thus, \( I_{L} = 1- (\frac{1}{2})^2-(\frac{1}{2})^2=0.5 \)
Node child right, R: \( p_0 = \frac{2}{3}, p_1 = \frac{1}{3} \). Thus, \( I_{R} = 1-(\frac{2}{3})^2 -(\frac{1}{3})^2 = 0.44 \)
Impurity Gain of Split 2:
\[ IG = 0.48 - \frac{2}{5} \cdot \frac{1}{2}-\frac{3}{5} \cdot 0.44 = 0.016 \]
For Split 3: \( N = 5, N_{left} =3, N_{right} = 2 \)
Node parent, A: \( I_{A} = 0.48 \)
Node child left, L: \( p_0 = \frac{1}{3}, p_1 = \frac{2}{3} \). Thus, \( I_{A} = 1-(\frac{1}{3})^2 -(\frac{2}{3})^2 = 0.44 \)
Node child right, R: \( p_0 = \frac{2}{2}, p_1 = \frac{0}{2} \). Thus, \( I_{R} = 1-0^2-1^2 = 0 \)
Impurity Gain of Split 3:
\[ IG = 0.48 - \frac{3}{5} \cdot 0.44 - \frac{2}{5} \cdot 0 = 0.216 \]
| IG | |
|---|---|
| Split 1 | 0.18 |
| Split 2 | 0.016 |
| Split 3 | 0.216 |
For Split 1: \( N = 5, N_{left} =1, N_{right} = 4 \)
Node parent, A: \( p_0 = \frac{2}{5}, p_1 = \frac{3}{5} \). Thus, \( I_{A} = - log_2(\frac{2}{5})-log_2(\frac{3}{5}) = 0.971 \)
Node child left, L: \( p_0 = \frac{0}{1} = 0, p_1 = \frac{1}{1} = 1 \). Thus, \( I_{L} = 0 \)
Node child right, R: \( p_0 = \frac{3}{4}, p_1 = \frac{1}{4} \). Thus, \[ I_{R} = -\frac{3}{4} \cdot log_2(\frac{3}{4})-\frac{1}{4} \cdot log_2(\frac{1}{4}) = 0.811 \]
Impurity Gain of Split 1:
\[ IG = 0.971 - \frac{1}{5} \cdot 0-\frac{4}{5} \cdot 0.811 = 0.322 \]
For Split 2: \( N = 5, N_{left} =2, N_{right} = 3 \)
Node parent, A: \( p_0 = \frac{2}{5}, p_1 = \frac{3}{5} \). Thus, \( I_{A} = 0.971 \)
Node child left, L: \( p_0 = \frac{1}{2}, p_1 = \frac{1}{2} \). Thus, \( I_{L} = - log_1(\frac{1}{2})-log_2(\frac{1}{2})=1 \)
Node child right, R: \( p_0 = \frac{2}{3}, p_1 = \frac{1}{3} \). Thus, \( I_{R} = -\frac{2}{3} \cdot log_2(\frac{2}{3}) -\frac{1}{3} \cdot log_2(\frac{1}{3}) = 0.918 \)
Impurity Gain of Split 2:
\[ IG = 0.971 - \frac{2}{5} \cdot 1-\frac{3}{5} \cdot 0.918 = 0.02 \]
For Split 3: \( N = 5, N_{left} =3, N_{right} = 2 \)
Node parent, A: \( I_{A} = 0.971 \)
Node child left, L: \( p_0 = \frac{1}{3}, p_1 = \frac{2}{3} \). Thus, \( I_{A} = -log_2(\frac{1}{3}) -log_2(\frac{2}{3}) = 0.918 \)
Node child right, R: \( p_0 = \frac{2}{2}, p_1 = \frac{0}{2} \). Thus, \( I_{R} = 0 \)
Impurity Gain of Split 3:
\[ IG = 0.971 - \frac{3}{5} \cdot 0.918 - \frac{2}{5} \cdot 0 = 0.42 \]
| IG | |
|---|---|
| Split 1 | 0.322 |
| Split 2 | 0.02 |
| Split 3 | 0.42 |
We want to test if \( X \) and \( Y \) are independent/associated
Test statistic:
\[ \sum\frac{(e_i-o_i)^2}{e_i} \sim \chi^2 \text{ distribution with degree of freedom} (n-1)(m-1) \]
| Greens | Reds | ||
|---|---|---|---|
| Left Branch | 0 (Cell 1) | 1 (Cell 2) | 1 |
| Right Branch | 3 (Cell 3) | 1 (Cell 4) | 4 |
| 3 | 2 |
\[ \chi^2 = \frac{(e_1-o_1)^2}{e_1}+\frac{(e_2-o_2)^2}{e_2}+\frac{(e_3-o_3)^2}{e_3}+\frac{(e_4-o_4)^2}{e_4} \]
\( i=4 \) (Cell 4): \( e_4 = \frac{2\cdot 4}{5} \), \( o_4 = 1 \)
Plug in, we have: \[ \chi^2 = 1.875 \]
| Greens | Reds | ||
|---|---|---|---|
| Left Branch | 1 (Cell 1) | 1 (Cell 2) | 2 |
| Right Branch | 2 (Cell 3) | 1 (Cell 4) | 3 |
| 3 | 2 |
\[ \chi^2 = \frac{(e_1-o_1)^2}{e_1}+\frac{(e_2-o_2)^2}{e_2}+\frac{(e_3-o_3)^2}{e_3}+\frac{(e_4-o_4)^2}{e_4} \]
\( i=4 \) (Cell 4): \( e_4 = \frac{3\cdot 2}{5} \), \( o_4 = 1 \)
Plug in, we have: \[ \chi^2 = 0.139 \]
| Greens | Reds | ||
|---|---|---|---|
| Left Branch | 1 (Cell 1) | 2 (Cell 2) | 3 |
| Right Branch | 2 (Cell 3) | 0 (Cell 4) | 2 |
| 3 | 2 |
\[ \chi^2 = \frac{(e_1-o_1)^2}{e_1}+\frac{(e_2-o_2)^2}{e_2}+\frac{(e_3-o_3)^2}{e_3}+\frac{(e_4-o_4)^2}{e_4} \]
(Cell 4): \( e_4 = \frac{3\cdot 2}{5} \), \( o_4 = 0 \)
Plug in, we have: \[ \chi^2 = 2.222 \]
| \( \chi^2 \) | |
|---|---|
| Split 1 | 1.875 |
| Split 2 | 0.139 |
| Split 3 | 2.222 |
| \( \chi^2 \) | p-value | logworth | |
|---|---|---|---|
| Split 1 | 1.875 | 0.114 | 0.943 |
| Split 2 | 0.139 | 0.998 | 0.0008 |
| Split 3 | 2.222 | 0.088 | 1.055 |
Greatest \( \chi^2 \) = Lowest \( p-value \) = Greatest logworth = Best Split
Split 3 is the best split!